Objectives: using spatial data analysis methods to study the Life Expectancy across various countries, aiming to uncover patterns and disparities between groups of countries.
Potential analysis methods: Using Exploratory Data Analysis (Descriptive and Inference Analysis) and Machine Learning for Spatial Prediction.
This dataset examines the geographic and socioeconomic factors influencing life ex- pectancy across various countries and years (from 2000 to 2015).
Source: Kaggle( link )
For spatial attribute (latitude and longitude), these value are merge from a countries coordinates dataset (link) based on Country name
Feature descriptions:
library(knitr)
library(kableExtra)
# Create a data frame with key attributes
key_attributes <- data.frame(
Feature = c("Country", "Year", "Status", "Longitude", "Latitude",
"Life expectancy", "Adult Mortality", "Infant deaths",
"Alcohol", "Percentage expenditure", "Hepatitis B",
"Measles", "BMI", "Under-five deaths", "Polio",
"Total expenditure", "Diphtheria", "HIV/AIDS", "GDP",
"Population", "Thinness 1-19 years", "Thinness 5-9 years",
"Income composition of resources", "Schooling"),
Description = c(
"Name of the country",
"Year of observation",
"Urban or rural status",
"Longitude for each country",
"Latitude for each country",
"Life expectancy at birth in years (average period that a person may expect to live)",
"Probability of dying between 15 and 60 years per 1000",
"Number of infant deaths per 1000 population",
"Alcohol consumption, measured as liters per capita",
"Expenditure on health as a percentage of GDP",
"Hepatitis B immunization coverage among 1-year-olds (%)",
"Number of reported measles cases per 1000 population",
"Average Body Mass Index of the population",
"Number of deaths under age five per 1000 population",
"Polio immunization coverage among 1-year-olds (%)",
"Total government health expenditure as a percentage of GDP",
"Diphtheria tetanus toxoid and pertussis immunization coverage among 1-year-olds (%)",
"Deaths per 1,000 live births due to HIV/AIDS (0-4 years)",
"Gross Domestic Product per capita (in USD)",
"Population of the country",
"Prevalence of thinness among children and adolescents aged 10-19 (%)",
"Prevalence of thinness among children aged 5-9 (%)",
"Human Development Index in terms of income composition of resources (0 to 1)",
"Number of years of schooling"
)
)
# Display table
kable(key_attributes, col.names = c("Feature", "Description"), caption = "Key Features") %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered"), full_width = FALSE) %>%
column_spec(1, bold = TRUE, border_right = TRUE) %>%
column_spec(2, width = "60%") %>%
row_spec(0, bold = TRUE, background = "#f7f7f7")
|
Feature |
Description |
|---|---|
|
Country |
Name of the country |
|
Year |
Year of observation |
|
Status |
Urban or rural status |
|
Longitude |
Longitude for each country |
|
Latitude |
Latitude for each country |
|
Life expectancy |
Life expectancy at birth in years (average period that a person may expect to live) |
|
Adult Mortality |
Probability of dying between 15 and 60 years per 1000 |
|
Infant deaths |
Number of infant deaths per 1000 population |
|
Alcohol |
Alcohol consumption, measured as liters per capita |
|
Percentage expenditure |
Expenditure on health as a percentage of GDP |
|
Hepatitis B |
Hepatitis B immunization coverage among 1-year-olds (%) |
|
Measles |
Number of reported measles cases per 1000 population |
|
BMI |
Average Body Mass Index of the population |
|
Under-five deaths |
Number of deaths under age five per 1000 population |
|
Polio |
Polio immunization coverage among 1-year-olds (%) |
|
Total expenditure |
Total government health expenditure as a percentage of GDP |
|
Diphtheria |
Diphtheria tetanus toxoid and pertussis immunization coverage among 1-year-olds (%) |
|
HIV/AIDS |
Deaths per 1,000 live births due to HIV/AIDS (0-4 years) |
|
GDP |
Gross Domestic Product per capita (in USD) |
|
Population |
Population of the country |
|
Thinness 1-19 years |
Prevalence of thinness among children and adolescents aged 10-19 (%) |
|
Thinness 5-9 years |
Prevalence of thinness among children aged 5-9 (%) |
|
Income composition of resources |
Human Development Index in terms of income composition of resources (0 to 1) |
|
Schooling |
Number of years of schooling |
Load Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(GWmodel)
## Loading required package: robustbase
## Loading required package: sp
## Loading required package: Rcpp
## Welcome to GWmodel version 2.4-2.
library(tmap)
library(ggplot2)
library(dplyr)
# Read datasets
df <- read.csv("LifeExpectancy.csv")
coordinates <- read.csv("coordinate.csv")
# Merge datasets based on country name
df <- merge(df, coordinates[, c("name", "latitude", "longitude")],
by.x = "Country", by.y = "name", all.x = TRUE)
df$ID <- paste(df$Country, df$Year, sep = "_")
# Compute missing value counts and percentages
na_counts <- colSums(is.na(df[, !names(df) %in% c("ID")]))
total_counts <- nrow(df)
na_percentage <- (na_counts / total_counts) * 100 # Convert to percentage
# Convert to data frame for visualization
na_df <- data.frame(Feature = names(na_counts), NA_Count = na_counts, NA_Percentage = na_percentage)
pastel_colors <- c("#FBB4AE", "#B3CDE3", "#CCEBC5", "#DECBE4", "#FED9A6", "#FFFFCC",
"#E5D8BD", "#FDDAEC", "#F2F2F2", "#A6D854", "#66C2A5", "#FC8D62",
"#8DA0CB", "#E78AC3", "#FFD92F", "#A6CEE3", "#B2DF8A", "#FB9A99",
"#CAB2D6", "#FFED6F", "#BC80BD", "#80B1D3", "#FFB3B3", "#BEBADA")
# Plot missing values as a horizontal bar chart with percentage
ggplot(na_df, aes(x = NA_Percentage, y = reorder(Feature, NA_Percentage), fill = Feature)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Missing Values of Features ",
x = "Percentage of Missing Values (%)",
y = "Feature") +
scale_x_continuous(breaks = seq(0, 100, by = 5)) + # Add x-axis ticks every 5%
scale_fill_manual(values = pastel_colors) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.background = element_rect(fill = "white"),
axis.text.y = element_text(size = 10)) # Adjust text size for readability
Data size before cleaning:
print(dim(df))
## [1] 2928 25
Data size after cleaning:
df <- df %>% drop_na() # Remove NA values
df$Country <- as.factor(df$Country) # Convert categorical variables
df$Status <- as.factor(df$Status)
print(dim(df))
## [1] 1602 25
In this part, we focus on the key feature: Life Expectancy.
Methodology: doing descriptive analysis then inference analysis to have an overview understanding about the general trend of this feature –> Conclusion: which other features can have an effect on Life Expectancy.
Using summary statistics to understand the general trends and giving quantitative insight.
# Load necessary libraries
library(dplyr)
library(knitr)
library(kableExtra) # For enhanced table formatting
# Compute summary statistics for Life Expectancy
summary_table <- df %>%
group_by(Status) %>%
summarise(
Count = n(),
Mean = round(mean(Life.expectancy, na.rm = TRUE), 3),
Standard_Deviation = round(sd(Life.expectancy, na.rm = TRUE), 3),
Min = min(Life.expectancy, na.rm = TRUE),
`Percentile 25%` = round(quantile(Life.expectancy, 0.25, na.rm = TRUE), 3),
`Percentile 50%` = round(median(Life.expectancy, na.rm = TRUE), 3),
`Percentile 75%` = round(quantile(Life.expectancy, 0.75, na.rm = TRUE), 3),
Max = max(Life.expectancy, na.rm = TRUE)
) %>%
bind_rows(
df %>% summarise(
Status = "All Countries",
Count = n(),
Mean = round(mean(Life.expectancy, na.rm = TRUE), 3),
Standard_Deviation = round(sd(Life.expectancy, na.rm = TRUE), 3),
Min = min(Life.expectancy, na.rm = TRUE),
`Percentile 25%` = round(quantile(Life.expectancy, 0.25, na.rm = TRUE), 3),
`Percentile 50%` = round(median(Life.expectancy, na.rm = TRUE), 3),
`Percentile 75%` = round(quantile(Life.expectancy, 0.75, na.rm = TRUE), 3),
Max = max(Life.expectancy, na.rm = TRUE)
)
)
# Transpose the table
summary_table_transposed <- as.data.frame(t(summary_table))
# Rename columns with country status
colnames(summary_table_transposed) <- summary_table_transposed[1, ] # First row as column names
summary_table_transposed <- summary_table_transposed[-1, ] # Remove the first row
# Print transposed summary table with better styling
kable(summary_table_transposed, caption = "Summary Statistics of Life Expectancy") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE) %>%
column_spec(1, border_right = TRUE, bold = TRUE) %>% # Add vertical border and bold first column
row_spec(0, bold = TRUE) # Bold header row
|
Developed |
Developing |
All Countries |
|
|---|---|---|---|
|
Count |
242 |
1360 |
1602 |
|
Mean |
78.692 |
67.629 |
69.300 |
|
Standard_Deviation |
4.273 |
8.465 |
8.904 |
|
Min |
69.9 |
44.0 |
44.0 |
|
Percentile 25% |
75.550 |
62.475 |
64.200 |
|
Percentile 50% |
78.95 |
69.20 |
71.70 |
|
Percentile 75% |
81.4 |
73.9 |
75.0 |
|
Max |
89 |
89 |
89 |
For other features:
summary(df)
## Country Year Status Life.expectancy
## Afghanistan: 16 Min. :2000 Developed : 242 Min. :44.0
## Albania : 16 1st Qu.:2005 Developing:1360 1st Qu.:64.2
## Armenia : 15 Median :2008 Median :71.7
## Austria : 15 Mean :2008 Mean :69.3
## Belarus : 15 3rd Qu.:2011 3rd Qu.:75.0
## Belgium : 15 Max. :2015 Max. :89.0
## (Other) :1510
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.010 Min. : 0.00
## 1st Qu.: 76.0 1st Qu.: 1.00 1st Qu.: 0.730 1st Qu.: 37.84
## Median :149.0 Median : 3.00 Median : 3.710 Median : 146.18
## Mean :168.3 Mean : 33.32 Mean : 4.506 Mean : 713.77
## 3rd Qu.:227.0 3rd Qu.: 23.00 3rd Qu.: 7.327 3rd Qu.: 527.30
## Max. :723.0 Max. :1600.00 Max. :17.870 Max. :18961.35
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.00 Min. : 0 Min. : 2.00 Min. : 0.0
## 1st Qu.:74.00 1st Qu.: 0 1st Qu.:19.23 1st Qu.: 1.0
## Median :89.00 Median : 15 Median :43.80 Median : 4.0
## Mean :79.16 Mean : 2275 Mean :38.15 Mean : 45.3
## 3rd Qu.:96.00 3rd Qu.: 368 3rd Qu.:55.80 3rd Qu.: 32.0
## Max. :99.00 Max. :131441 Max. :77.10 Max. :2100.0
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.740 Min. : 2.00 Min. : 0.100
## 1st Qu.:79.00 1st Qu.: 4.393 1st Qu.:81.00 1st Qu.: 0.100
## Median :92.00 Median : 5.860 Median :92.00 Median : 0.100
## Mean :83.29 Mean : 5.960 Mean :84.06 Mean : 2.028
## 3rd Qu.:97.00 3rd Qu.: 7.480 3rd Qu.:97.00 3rd Qu.: 0.700
## Max. :99.00 Max. :14.390 Max. :99.00 Max. :50.600
##
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.100
## 1st Qu.: 462.24 1st Qu.:2.125e+05 1st Qu.: 1.600
## Median : 1620.57 Median :1.425e+06 Median : 3.000
## Mean : 5642.83 Mean :1.453e+07 Mean : 4.827
## 3rd Qu.: 4787.85 3rd Qu.:7.532e+06 3rd Qu.: 7.000
## Max. :119172.74 Max. :1.294e+09 Max. :27.200
##
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.100 Min. :0.0000 Min. : 4.20
## 1st Qu.: 1.600 1st Qu.:0.5060 1st Qu.:10.30
## Median : 3.100 Median :0.6750 Median :12.30
## Mean : 4.886 Mean :0.6314 Mean :12.12
## 3rd Qu.: 7.100 3rd Qu.:0.7520 3rd Qu.:14.00
## Max. :28.200 Max. :0.9360 Max. :20.70
##
## latitude longitude ID
## Min. :-38.42 Min. :-175.198 Length:1602
## 1st Qu.: -1.94 1st Qu.: -8.224 Class :character
## Median : 15.45 Median : 23.881 Mode :character
## Mean : 16.43 Mean : 15.585
## 3rd Qu.: 39.40 3rd Qu.: 46.869
## Max. : 60.13 Max. : 179.414
##
Histogram for Life Expectancy distribution: l
Left-skewed: the left tail extends towards 40-50 years, the right tail extends to 85-90 years.
Bimodal pattern: slightly dip before the highest peak –> One group with high life expectancy (70-80 years) and lower life expectancy (50-60 years)
ggplot(df, aes(x = Life.expectancy)) +
geom_histogram(aes(y=after_stat(density)),binwidth = 2, fill = "#6A8CAF", color = "white", alpha = 0.8) +
geom_density(color = "blue", size = 0.5)+
labs(title = "Distribution of Life Expectancy", x = "Life Expectancy (Years) ", y = "Count") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5))
Box plot of Life Expectancy between group of countries:
Developed countries: higher, less variability
Developing countries: lower, more variability
# Life expectancy between developed and developing country
ggplot(df, aes(x = Status, y = Life.expectancy, fill = Status)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2, whiskerwidth = 1 ) +
scale_fill_manual(values = c("Developed" = "#AEC6CF", "Developing" = "#FFDAC1")) +
labs(title = "Life Expectancy in Developed vs. Developing Countries",
x = "Status",
y = "Life Expectancy",
fill = "Country Status") + # Legend title
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
legend.position = "none")
# Filter data for developing countries
df_developing <- df[df$Status == "Developing", ]
# Plot histogram
ggplot(df_developing, aes(x = Life.expectancy)) +
geom_histogram(binwidth = 2, fill = "#FFDAC1", color = "black", alpha = 0.7) +
labs(title = "Histogram of Life Expectancy in Developing Countries",
x = "Life Expectancy (Years)",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
developed_countries <- df[df$Status == "Developing"& df$Life.expectancy<50,]
id_list <- developed_countries$ID
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Columns to compare
columns_to_plot <- c(
"percentage.expenditure", "Total.expenditure",
"GDP", "Income.composition.of.resources",
"Schooling", "Alcohol"
)
# Subset data for id_list and the rest of the data
developed_data <- df[df$ID %in% id_list, columns_to_plot]
other_data <- df[!df$ID %in% id_list, columns_to_plot]
# Create a new column to indicate the groups (id_list vs. rest of the data)
developed_data$Group <- "Outlier"
other_data$Group <- "Normal"
# Combine both datasets
combined_data <- rbind(developed_data, other_data)
# Reshape the data for plotting
combined_data_melted <- melt(combined_data, id.vars = "Group",
variable.name = "Feature", value.name = "Value")
# Custom subplot names
feature_names <- c(
"percentage.expenditure" = "Health Expenditure (%)",
"Total.expenditure" = "Total Expenditure (%)",
"GDP" = "GDP",
"Income.composition.of.resources" = "Income Composition",
"Schooling" = "Years of Schooling",
"Alcohol" = "Alcohol Consumption"
)
# Create the box plots
ggplot(combined_data_melted, aes(x = Group, y = Value, fill = Group)) +
geom_boxplot(outlier.size = 0.8, width = 0.3) + # Smaller boxplots
facet_wrap(~ Feature, scales = "free_y", ncol = 3, labeller = as_labeller(feature_names)) +
scale_fill_manual(values = c("Outlier" = "#FBB4AE", "Normal" = "#B3CDE3")) + # Pastel colors
labs(title = "Comparison of Selected Features between Outliers and Normal group",
x="",
y = "Value") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(size = 8), # Smaller x-axis labels
axis.text.y = element_text(size = 7), # Smaller y-axis labels
strip.text = element_text(size = 10), # Smaller subplot titles
plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
panel.spacing = unit(1, "lines"))
library(corrplot)
## corrplot 0.95 loaded
# Select only numeric columns (Base R)
numeric_cols <- sapply(df, is.numeric) # Identify numeric columns
corr_matrix <- cor(df[, numeric_cols], use = "complete.obs") # Compute correlation
# Plot the correlation matrix
corrplot(corr_matrix, tl.col = "darkblue", tl.srt = 45, method = "color")
# Add title
title("Correlation Matrix of Features", col.main = "black", font.main = 2, cex.main = 1.5)
Goals: Quantify how strong these features related together in a linear relationship.
In this report, we chose 4 features which have a positive effect on Life Expectancy: GDP, Total Expenditure, Income Composition of Resources and Schooling.
library(knitr)
# Compute correlations
correlations <- data.frame(
Variable = c("GDP", "Total Expenditure", "Income Composition of Resources", "Schooling"),
Pearson_Correlation = c(
cor(df$Life.expectancy, df$GDP, use = "complete.obs", method = "pearson"),
cor(df$Life.expectancy, df$Total.expenditure, use = "complete.obs", method = "pearson"),
cor(df$Life.expectancy, df$Income.composition.of.resources, use = "complete.obs", method = "pearson"),
cor(df$Life.expectancy, df$Schooling, use = "complete.obs", method = "pearson")
)
)
# Display table
kable(correlations, col.names = c("Variable", "Pearson Correlation"), caption = "Correlation with Life Expectancy")
| Variable | Pearson Correlation |
|---|---|
| GDP | 0.4432124 |
| Total Expenditure | 0.1810506 |
| Income Composition of Resources | 0.7251987 |
| Schooling | 0.7320096 |
The result shows that:
Moderate positive correlation between Life expectancy and GDP = higher GDP is generally associated with longer life expectancy.
Weak positive correlation between Life Expectancy and Total Expenditure = government health spending has a small effect. This might be because factors like efficiency of spending matter more.
Strong positive correlation between Life Expectancy and Income Composition of Resources = a country’s economic equality (human development factors like education, income distribution) has a significant effect on life expectancy.
df$Income.composition.norm <- scale(df$Income.composition.of.resources)
model <- lm(Life.expectancy ~ GDP + Total.expenditure + Income.composition.norm + Schooling, data = df)
summary(model)
##
## Call:
## lm(formula = Life.expectancy ~ GDP + Total.expenditure + Income.composition.norm +
## Schooling, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3218 -2.4649 0.6011 3.3994 17.7492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.379e+01 1.016e+00 52.965 < 2e-16 ***
## GDP 7.018e-05 1.389e-05 5.054 4.83e-07 ***
## Total.expenditure -1.290e-02 6.296e-02 -0.205 0.838
## Income.composition.norm 3.326e+00 2.290e-01 14.527 < 2e-16 ***
## Schooling 1.253e+00 8.295e-02 15.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.626 on 1597 degrees of freedom
## Multiple R-squared: 0.6017, Adjusted R-squared: 0.6007
## F-statistic: 603.1 on 4 and 1597 DF, p-value: < 2.2e-16
The result show that:
Multiple R-squared = 0.6017: the model explains 60.17% of the variance in life expectancy
F-statistic = 603.1, p < 2.2e-16: the model is highly significant, meaning at least one predictor strongly affects life expectancy.
p-value: GDP and Income composition are significant effect in life expectancy (p-value <0.001). Especially Income composition with 1 unit increase in the variable results in a 3.3-year increase in life expectancy. On the other hand, Total expenditure has no significant effect on life expectancy at 5% level (p-value > 0.05)
Relationship between features
# Load necessary library
library(ggplot2)
# Define custom pastel colors
point_color <- "#56B4E9" # Pastel blue for points
line_color <- "#E69F00" # Pastel orange for regression line
fill_color <- "#E69F00" # Light pastel orange for confidence interval
# GDP vs. Life Expectancy (Linear Regression)
p1 <- ggplot(df, aes(x = GDP, y = Life.expectancy)) +
geom_point(alpha = 0.5, color = point_color) + # Transparent pastel blue points
geom_smooth(method = "lm", color = line_color, fill = fill_color, alpha = 0.3) + # Regression line with shading
labs(title = "GDP vs. Life Expectancy",
x = "GDP per Capita",
y = "Life Expectancy") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
legend.position = "none")
# Total Expenditure vs. Life Expectancy (Linear Regression)
p2 <- ggplot(df, aes(x = Total.expenditure, y = Life.expectancy)) +
geom_point(alpha = 0.5, color = point_color) +
geom_smooth(method = "lm", color = line_color, fill = fill_color, alpha = 0.3) +
labs(title = "Healthcare Expenditure vs. Life Expectancy",
x = "Total Healthcare Expenditure",
y = "Life Expectancy") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
legend.position = "none")
# Income Composition of Resources vs. Life Expectancy (Linear Regression)
p3 <- ggplot(df, aes(x = Income.composition.norm, y = Life.expectancy)) +
geom_point(alpha = 0.5, color = point_color) +
geom_smooth(method = "lm", color = line_color, fill = fill_color, alpha = 0.3) +
labs(title = "Income Composition vs. Life Expectancy",
x = "Income Composition of Resources",
y = "Life Expectancy") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
legend.position = "none")
# Schooling vs. Life Expectancy (Linear Regression)
p4 <- ggplot(df, aes(x = Schooling, y = Life.expectancy)) +
geom_point(alpha = 0.5, color = point_color) +
geom_smooth(method = "lm", color = line_color, fill = fill_color, alpha = 0.3) +
labs(title = "Schooling vs. Life Expectancy",
x = "Schooling",
y = "Life Expectancy") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
legend.position = "none")
# Display plots together
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, p3,p4, ncol = 1) # Arrange plots in a single row
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Moran’s I statistic measures how similar neighboring observations are.
Result interpretation:
p-value < 0.001 → Statistically significant, meaning life expectancy is not randomly distributed but follows a spatial pattern
Moran’s I = 0.921 → Strong positive spatial autocorrelation (high/low life expectancy regions cluster together).
library(leaflet)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
##
## Attaching package: 'rnaturalearthdata'
## The following object is masked from 'package:rnaturalearth':
##
## countries110
library(spdep) # For spatial clustering
library(dplyr)
# Convert dataframe to an sf spatial object
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326)
# Slightly jitter coordinates to avoid identical points issue
set.seed(42)
df_sf$geometry <- st_jitter(df_sf$geometry, amount = 1e-5)
# Load world map data
world <- ne_countries(scale = "medium", returnclass = "sf") # Keep separate!
# Define color palette based on Life Expectancy
pal <- colorNumeric(palette = "viridis", domain = df_sf$Life.expectancy)
# Create spatial neighbors using k-nearest neighbors
coords <- st_coordinates(df_sf)
neighbors <- knearneigh(coords, k = 5)
nb <- knn2nb(neighbors)
# Convert to spatial weights
weights <- nb2listw(nb, style = "W")
# Compute Moran's I test for spatial clustering
morans_test <- moran.test(df_sf$Life.expectancy, listw = weights)
print(morans_test) # Check p-value to confirm clustering
##
## Moran I test under randomisation
##
## data: df_sf$Life.expectancy
## weights: weights
##
## Moran I statistic standard deviate = 62.203, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.9210103270 -0.0006246096 0.0002195275
# Create an interactive leaflet map
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>% # Add basemap
# Add world map as a background layer (optional)
addPolygons(data = world, fillOpacity = 0.1, color = "gray", weight = 1) %>%
# Add life expectancy points
addCircleMarkers(
data = df_sf,
radius = 5,
color = ~pal(Life.expectancy),
stroke = FALSE,
fillOpacity = 0.8,
popup = ~paste0("Country: ", Country, "<br>",
"Life Expectancy: ", round(Life.expectancy, 2))
) %>%
# Add legend
addLegend(
"bottomright",
pal = pal,
values = df_sf$Life.expectancy,
title = "Life Expectancy",
opacity = 1
) %>%
addScaleBar(position = "bottomleft")
Goals: to predict high-risk regions (regions where life expectancy is in the bottom 25%) based on geographical and socioeconomic information. This is framed as a binary classification problem, where:
1 = high-risk region (low life expectancy).
0 = low-risk region (higher life expectancy).
Methodology: Using Random Forest model which can handles non-linear relationships data. Since life expectancy can have non-linear dependencies on other features.
Evaluation: using ROC curve
library(caret) # Machine learning utilities
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest) # Random Forest model
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(ggplot2) # Visualization
library(sf) # Spatial data handling
library(dplyr) # Data manipulation
# Define High-Risk Regions (Bottom 25% Life Expectancy)
threshold <- quantile(df$Life.expectancy, 0.25, na.rm = TRUE)
df$HighRisk <- ifelse(df$Life.expectancy < threshold, 1, 0)
# Ensure HighRisk is a Factor
df$HighRisk <- factor(df$HighRisk, levels = c(0, 1))
# Select Features for Prediction
df_ml <- subset(df, select = c(HighRisk, GDP, percentage.expenditure, Population, Schooling, Income.composition.of.resources, latitude, longitude))
# Train-Test Split (80% Train, 20% Test)
set.seed(42)
trainIndex <- createDataPartition(df_ml$HighRisk, p = 0.8, list = FALSE)
trainData <- df_ml[trainIndex, ]
testData <- df_ml[-trainIndex, ]
# Train Random Forest Model
rf_model <- randomForest(HighRisk ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)
# Model Evaluation on Test Data
predictions <- predict(rf_model, testData)
# Ensure predictions and test labels have same factor levels
testData$HighRisk <- factor(testData$HighRisk, levels = c(0, 1))
predictions <- factor(predictions, levels = c(0, 1))
# Compute Confusion Matrix
conf_matrix <- confusionMatrix(predictions, testData$HighRisk)
print(conf_matrix) # Accuracy, Precision, Recall
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 235 5
## 1 5 74
##
## Accuracy : 0.9687
## 95% CI : (0.9431, 0.9849)
## No Information Rate : 0.7524
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9159
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 0.9367
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9367
## Prevalence : 0.7524
## Detection Rate : 0.7367
## Detection Prevalence : 0.7524
## Balanced Accuracy : 0.9579
##
## 'Positive' Class : 0
##
# Feature Importance
importance(rf_model)
## 0 1 MeanDecreaseAccuracy
## GDP 12.383482 11.84097 18.84949
## percentage.expenditure 7.766477 10.45633 13.64698
## Population 12.100128 10.57706 16.01370
## Schooling 21.332472 26.37147 34.55436
## Income.composition.of.resources 30.667872 65.25777 61.73929
## latitude 31.625872 39.81425 47.70672
## longitude 29.957300 35.45906 39.17277
## MeanDecreaseGini
## GDP 30.56907
## percentage.expenditure 18.14733
## Population 18.67553
## Schooling 89.64216
## Income.composition.of.resources 212.40755
## latitude 54.36241
## longitude 56.20256
varImpPlot(rf_model)
# Check if both classes are predicted
table(predictions)
## predictions
## 0 1
## 240 79
table(testData$HighRisk)
##
## 0 1
## 240 79
# Visualization: Map of Predicted High-Risk Regions
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326) # Convert to Spatial Object
df_sf$Predicted_Risk <- predict(rf_model, df_ml)
cm <- conf_matrix$table
# True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN)
TP <- cm[2, 2] # High-Risk correctly predicted as High-Risk
FN <- cm[1, 2] # Non-High-Risk incorrectly predicted as High-Risk
FP <- cm[2, 1] # High-Risk incorrectly predicted as Non-High-Risk
TN <- cm[1, 1] # Non-High-Risk correctly predicted as Non-High-Risk
# True Positive Rate (TPR) and False Positive Rate (FPR)
TPR <- TP / (TP + FN) # True Positive Rate
FPR <- FP / (FP + TN) # False Positive Rate
# Print TPR and FPR
cat("True Positive Rate (TPR):", TPR, "\n")
## True Positive Rate (TPR): 0.9367089
cat("False Positive Rate (FPR):", FPR, "\n")
## False Positive Rate (FPR): 0.02083333
ggplot() +
geom_sf(data = df_sf, aes(color = as.factor(Predicted_Risk))) +
scale_color_manual(values = c("#B3CDE3", "#FFB3BA"), labels = c("Low-Risk","High-Risk")) +
labs(title = "Predicted High-Risk Regions", color = "Risk Level") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
pred_probs <- predict(rf_model, testData, type = "prob")[,2] # Probabilities for High-Risk class (1)
# Compute ROC Curve
roc_obj <- roc(testData$HighRisk, pred_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC Curve
ggplot(data = data.frame(TPR = roc_obj$sensitivities,
FPR = 1 - roc_obj$specificities)) +
geom_line(aes(x = FPR, y = TPR), color = "#CDB7F6", size = 1) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
labs(title = paste("ROC Curve"),
x = "False Positive Rate ",
y = "True Positive Rate ") +
theme_minimal()
Result interpretation:
Having an overview about Life Expectancy between various region and which factor can impact it, using multiples Exploratory Data Analysis Method and Spatial Analysis Method.
Adapting a machine learning model to building a classifier for high-risk region, using geographical and socioeconomic features. These model can be helpful in determining strategy for improving Life expectancy within countries.